mean_pinball_loss (Pinball / Quantile Loss)#

Mean pinball loss (pinball loss / quantile loss) measures the average error of quantile predictions.

If you predict an \(\alpha\)-quantile (e.g. \(\alpha=0.9\) for the 90th percentile), pinball loss answers:

“How far off are my quantile predictions, with the correct asymmetric cost?”

It is the standard loss behind quantile regression and a building block for prediction intervals (fit two quantiles, e.g. 0.05 and 0.95).


Learning goals#

  • Define pinball loss precisely (with math)

  • Build intuition for the asymmetric penalty controlled by \(\alpha\)

  • Implement mean_pinball_loss from scratch in NumPy (including weights / multi-output)

  • Visualize why the minimizer is the \(\alpha\)-quantile

  • Use mean pinball loss to fit a simple linear quantile regressor (subgradient descent)

  • Understand pros/cons, pitfalls, and best use cases

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from scipy.stats import norm
from sklearn.linear_model import QuantileRegressor
from sklearn.metrics import mean_pinball_loss

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(42)

1) Definition#

For a target \(y\) and a prediction \(\hat{y}\), the pinball loss at quantile level \(\alpha \in [0, 1]\) is:

\[\begin{split} \ell_{\alpha}(y, \hat{y}) = \begin{cases} \alpha\,(y - \hat{y}) & \text{if } y \ge \hat{y} \\ (1-\alpha)\,(\hat{y} - y) & \text{if } y < \hat{y} \end{cases} \end{split}\]

Equivalently, using the residual \(u = y - \hat{y}\):

\[ \ell_{\alpha}(y, \hat{y}) = \alpha\,\max(u, 0) + (1-\alpha)\,\max(-u, 0) \]

Or with an indicator:

\[ \ell_{\alpha}(y, \hat{y}) = \bigl(\alpha - \mathbf{1}\{y < \hat{y}\}\bigr)\,(y - \hat{y}) \]

For \(n\) samples, mean pinball loss is:

\[ \mathrm{MPL}_{\alpha}(y, \hat{y}) = \frac{1}{n}\sum_{i=1}^{n} \ell_{\alpha}(y_i, \hat{y}_i) \]

Connection to MAE#

At \(\alpha = 0.5\), the loss is symmetric and proportional to absolute error:

\[ \ell_{0.5}(y, \hat{y}) = 0.5\,|y-\hat{y}| \]

So it has the same minimizer as MAE (the median), but differs by a constant scaling factor.

Weighted version#

With non-negative sample weights \(w_i\):

\[ \mathrm{MPL}_{\alpha,w}(y, \hat{y}) = \frac{\sum_{i=1}^{n} w_i\,\ell_{\alpha}(y_i, \hat{y}_i)}{\sum_{i=1}^{n} w_i} \]

The loss is in the same unit as \(y\), is always non-negative, and the best value is \(0\).

# A tiny example: alpha controls the asymmetry

y_true = np.array([1.0, 2.0, 3.0])

# under-predict the first sample vs over-predict the last sample
pred_under = np.array([0.0, 2.0, 3.0])
pred_over = np.array([1.0, 2.0, 4.0])

for alpha in [0.1, 0.5, 0.9]:
    loss_under = mean_pinball_loss(y_true, pred_under, alpha=alpha)
    loss_over = mean_pinball_loss(y_true, pred_over, alpha=alpha)
    print(f"alpha={alpha:.1f} | loss(under)={float(loss_under):.4f} | loss(over)={float(loss_over):.4f}")
alpha=0.1 | loss(under)=0.0333 | loss(over)=0.3000
alpha=0.5 | loss(under)=0.1667 | loss(over)=0.1667
alpha=0.9 | loss(under)=0.3000 | loss(over)=0.0333

2) Intuition: a “tilted” absolute error#

Pinball loss is a piecewise-linear function of the prediction \(\hat{y}\).

  • If you under-predict (\(\hat{y} < y\)), the loss slope is \(-\alpha\).

  • If you over-predict (\(\hat{y} > y\)), the loss slope is \((1-\alpha)\).

So \(\alpha\) chooses which mistake is more expensive:

  • \(\alpha=0.9\) (upper quantile): under-prediction is expensive, over-prediction is cheap → pushes predictions upward.

  • \(\alpha=0.1\) (lower quantile): over-prediction is expensive, under-prediction is cheap → pushes predictions downward.

# Visualize the loss for a single sample

y0 = 2.0

yhat_grid = np.linspace(y0 - 4, y0 + 4, 400)

fig = go.Figure()
for alpha in [0.1, 0.5, 0.9]:
    diff = y0 - yhat_grid
    loss = alpha * np.maximum(diff, 0) + (1 - alpha) * np.maximum(-diff, 0)
    fig.add_trace(go.Scatter(x=yhat_grid, y=loss, mode="lines", name=f"alpha={alpha}"))

fig.add_vline(x=y0, line_dash="dash", line_color="gray")
fig.update_layout(
    title="Pinball loss for one sample (y = 2.0)",
    xaxis_title="prediction ŷ",
    yaxis_title="loss ℓα(y, ŷ)",
)
fig.show()

3) Why it targets quantiles#

Consider a constant predictor \(\hat{y}=c\) (no features).

A key fact:

The value of \(c\) that minimizes the expected pinball loss is an \(\alpha\)-quantile of \(Y\).

Sketch (subgradient): for one sample,

  • if \(c < y\) then \(\ell_\alpha(y,c) = \alpha(y-c)\) and \(\partial_c \ell_\alpha = -\alpha\)

  • if \(c > y\) then \(\ell_\alpha(y,c) = (1-\alpha)(c-y)\) and \(\partial_c \ell_\alpha = (1-\alpha)\)

So the expected subgradient at \(c\) is:

\[ \mathbb{E}[\partial_c\,\ell_\alpha(Y,c)] = (1-\alpha)\,\mathbb{P}(Y < c) - \alpha\,\mathbb{P}(Y > c) \]

The optimum is where this crosses \(0\), which gives the quantile condition.

In a finite sample, minimizing mean pinball loss over \(c\) yields the sample \(\alpha\)-quantile.

# Demo: the best constant predictor is the sample quantile

n = 500
# A skewed distribution to make the shift visible
y = rng.lognormal(mean=0.0, sigma=0.6, size=n)

alphas = [0.1, 0.5, 0.9]

lo, hi = np.quantile(y, [0.01, 0.99])
c_grid = np.linspace(lo, hi, 400)

fig = go.Figure()

for alpha in alphas:
    diff = y[:, None] - c_grid[None, :]
    loss = alpha * np.maximum(diff, 0) + (1 - alpha) * np.maximum(-diff, 0)
    mpl = loss.mean(axis=0)

    q = float(np.quantile(y, alpha))
    c_star = float(c_grid[np.argmin(mpl)])

    fig.add_trace(go.Scatter(x=c_grid, y=mpl, mode="lines", name=f"alpha={alpha}"))
    fig.add_vline(x=q, line_dash="dash", line_color="gray")
    fig.add_vline(x=c_star, line_dash="dot", line_color="gray")

fig.update_layout(
    title="Constant prediction: minimizer is the α-quantile",
    xaxis_title="constant prediction c",
    yaxis_title="mean pinball loss",
)
fig.show()

4) NumPy implementation (from scratch)#

Below is a small implementation that mirrors scikit-learn’s API shape:

  • supports sample_weight

  • supports multi-output targets ((n_samples, n_outputs))

  • supports multioutput='raw_values' or averaging across outputs

def _as_2d(a):
    a = np.asarray(a)
    if a.ndim == 1:
        return a.reshape(-1, 1)
    return a


def mean_pinball_loss_np(
    y_true,
    y_pred,
    *,
    alpha=0.5,
    sample_weight=None,
    multioutput="uniform_average",
):
    y_true = _as_2d(y_true)
    y_pred = _as_2d(y_pred)

    if y_true.shape != y_pred.shape:
        raise ValueError(f"shape mismatch: y_true{y_true.shape} vs y_pred{y_pred.shape}")

    if not (0.0 <= alpha <= 1.0):
        raise ValueError("alpha must be in [0, 1]")

    diff = y_true - y_pred
    loss = alpha * np.maximum(diff, 0.0) + (1.0 - alpha) * np.maximum(-diff, 0.0)

    # average over samples -> per-output errors
    output_errors = np.average(loss, weights=sample_weight, axis=0)

    if multioutput == "raw_values":
        return output_errors

    if multioutput == "uniform_average":
        return float(np.mean(output_errors))

    # multioutput is array-like weights for outputs
    multioutput = np.asarray(multioutput)
    return float(np.average(output_errors, weights=multioutput))


# Quick equivalence checks with scikit-learn

y_true = rng.normal(size=50)
y_pred = y_true + rng.normal(scale=0.5, size=50)
w = rng.uniform(0.5, 2.0, size=50)

for alpha in [0.1, 0.5, 0.9]:
    a0 = float(mean_pinball_loss(y_true, y_pred, alpha=alpha))
    b0 = mean_pinball_loss_np(y_true, y_pred, alpha=alpha)

    a1 = float(mean_pinball_loss(y_true, y_pred, alpha=alpha, sample_weight=w))
    b1 = mean_pinball_loss_np(y_true, y_pred, alpha=alpha, sample_weight=w)

    print(f"alpha={alpha:.1f} | unweighted diff={a0-b0:+.2e} | weighted diff={a1-b1:+.2e}")

# Multi-output example
Y_true = rng.normal(size=(40, 2))
Y_pred = Y_true + rng.normal(scale=0.3, size=(40, 2))

mean_pinball_loss_np(Y_true, Y_pred, alpha=0.5, multioutput="raw_values")
alpha=0.1 | unweighted diff=+0.00e+00 | weighted diff=+0.00e+00
alpha=0.5 | unweighted diff=+0.00e+00 | weighted diff=+0.00e+00
alpha=0.9 | unweighted diff=+0.00e+00 | weighted diff=+0.00e+00
array([0.1162, 0.1234])

5) Using mean pinball loss to optimize a model (linear quantile regression)#

Let a linear model be:

\[ \hat{y} = Xw + b \]

Quantile regression fits the \(\alpha\)-quantile by minimizing mean pinball loss:

\[ \min_{w,b}\; \frac{1}{n}\sum_{i=1}^{n} \ell_{\alpha}\bigl(y_i, x_i^\top w + b\bigr) \]

For linear models this objective is convex, but it is not differentiable at \(y=\hat{y}\). A simple low-level optimizer is subgradient descent.

Subgradient w.r.t. prediction#

For a single sample, with \(u = y - \hat{y}\):

\[\begin{split} \frac{\partial \ell_{\alpha}}{\partial \hat{y}} = \begin{cases} -\alpha & u > 0 \\ (1-\alpha) & u < 0 \\ [-\alpha,\; 1-\alpha] & u = 0 \end{cases} \end{split}\]

Then the parameter subgradients are:

\[ \nabla_w L = \frac{1}{n}X^\top g, \qquad \nabla_b L = \frac{1}{n}\sum_{i=1}^{n} g_i \]

where \(g_i\) is a chosen subgradient for sample \(i\).

def fit_linear_quantile_regression_subgd(
    X,
    y,
    *,
    alpha=0.5,
    lr=0.2,
    n_iters=3000,
    log_every=50,
):
    X = np.asarray(X, dtype=float)
    y = np.asarray(y, dtype=float).reshape(-1)

    n, d = X.shape
    w = np.zeros(d)
    b = 0.0

    steps = []
    losses = []

    for t in range(1, n_iters + 1):
        y_pred = X @ w + b
        diff = y - y_pred

        # subgradient wrt prediction ŷ
        g = np.where(diff > 0, -alpha, 1.0 - alpha)
        g = np.where(diff == 0, 0.0, g)  # pick 0 at the kink

        grad_w = (X.T @ g) / n
        grad_b = float(np.mean(g))

        lr_t = lr / np.sqrt(t)  # diminishing step size
        w -= lr_t * grad_w
        b -= lr_t * grad_b

        if t % log_every == 0 or t == 1:
            y_pred = X @ w + b
            losses.append(mean_pinball_loss_np(y, y_pred, alpha=alpha))
            steps.append(t)

    return w, b, np.array(steps), np.array(losses)


# Synthetic heteroscedastic data where the true conditional quantiles are linear
n = 600
x = rng.uniform(0, 10, size=n)

beta0 = 1.0
beta1 = 2.0
sigma0 = 0.5
sigma1 = 0.2  # noise scale increases with x

sigma = sigma0 + sigma1 * x
noise = rng.normal(loc=0.0, scale=sigma, size=n)

y = beta0 + beta1 * x + noise
X = x.reshape(-1, 1)

alphas = [0.1, 0.5, 0.9]

fits = {}
for a in alphas:
    w, b, steps, losses = fit_linear_quantile_regression_subgd(X, y, alpha=a)
    fits[a] = {"w": w, "b": b, "steps": steps, "losses": losses}

{k: (v['w'][0], v['b']) for k, v in fits.items()}
{0.1: (1.7306963005697273, 0.22191021030730798),
 0.5: (2.0163591712333595, 0.8064745345330414),
 0.9: (2.259015414769386, 1.3579550011540444)}
# Visualize fitted quantile lines

x_grid = np.linspace(x.min(), x.max(), 250)
X_grid = x_grid.reshape(-1, 1)

colors = {0.1: "#1f77b4", 0.5: "#2ca02c", 0.9: "#d62728"}

fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x=x,
        y=y,
        mode="markers",
        name="data",
        marker=dict(size=6, opacity=0.45),
    )
)

for a in alphas:
    w, b = fits[a]["w"], fits[a]["b"]
    y_hat = X_grid @ w + b
    fig.add_trace(
        go.Scatter(
            x=x_grid,
            y=y_hat,
            mode="lines",
            name=f"fit α={a}",
            line=dict(color=colors[a], width=3),
        )
    )

# (Optional) true quantile lines (we know the data-generating process)
for a in alphas:
    z = norm.ppf(a)
    y_true_q = (beta0 + z * sigma0) + (beta1 + z * sigma1) * x_grid
    fig.add_trace(
        go.Scatter(
            x=x_grid,
            y=y_true_q,
            mode="lines",
            name=f"true α={a}",
            line=dict(color=colors[a], width=2, dash="dash"),
            opacity=0.85,
        )
    )

fig.update_layout(
    title="Linear quantile regression trained with mean pinball loss",
    xaxis_title="x",
    yaxis_title="y",
)
fig.show()
# Training curves

fig = go.Figure()
for a in alphas:
    fig.add_trace(
        go.Scatter(
            x=fits[a]["steps"],
            y=fits[a]["losses"],
            mode="lines",
            name=f"α={a}",
            line=dict(width=3, color=colors[a]),
        )
    )

fig.update_layout(
    title="Subgradient descent on mean pinball loss",
    xaxis_title="iteration",
    yaxis_title="mean pinball loss",
)
fig.show()

Sanity check: scikit-learn’s QuantileRegressor#

QuantileRegressor solves the same optimization problem (with optional regularization) using linear programming. We’ll fit it (without regularization) and compare parameters and scores.

sk_fits = {}
for a in alphas:
    model = QuantileRegressor(quantile=a, alpha=0.0, solver="highs")
    model.fit(X, y)
    sk_fits[a] = model

for a in alphas:
    w_subgd, b_subgd = fits[a]["w"][0], fits[a]["b"]
    w_sk, b_sk = sk_fits[a].coef_[0], sk_fits[a].intercept_

    print(
        f"α={a}: subGD w={w_subgd:.3f}, b={b_subgd:.3f} | sklearn w={w_sk:.3f}, b={b_sk:.3f}"
    )
α=0.1: subGD w=1.731, b=0.222 | sklearn w=1.743, b=0.147
α=0.5: subGD w=2.016, b=0.806 | sklearn w=2.010, b=0.842
α=0.9: subGD w=2.259, b=1.358 | sklearn w=2.218, b=1.594

6) Practical usage: scoring quantile predictions#

In practice you typically:

  • train a model to predict a specific quantile (say \(\alpha=0.9\))

  • evaluate it with mean_pinball_loss(y_true, y_pred, alpha=0.9)

If you predict multiple quantiles, compute the metric for each \(\alpha\) separately.

# Score subgradient-descent vs sklearn solutions on the same data

for a in alphas:
    y_hat_subgd = X @ fits[a]["w"] + fits[a]["b"]
    y_hat_sklearn = sk_fits[a].predict(X)

    score_subgd = mean_pinball_loss(y, y_hat_subgd, alpha=a)
    score_sklearn = mean_pinball_loss(y, y_hat_sklearn, alpha=a)

    print(f"α={a}: mean_pinball_loss subGD={float(score_subgd):.4f} | sklearn={float(score_sklearn):.4f}")
α=0.1: mean_pinball_loss subGD=0.2667 | sklearn=0.2666
α=0.5: mean_pinball_loss subGD=0.5943 | sklearn=0.5942
α=0.9: mean_pinball_loss subGD=0.2589 | sklearn=0.2564

7) Pros / cons and when to use#

Pros#

  • Targets quantiles directly → prediction intervals and risk-sensitive forecasting

  • Asymmetric costs: tune \(\alpha\) to match business penalties (stockouts vs overstock, SLA breaches, etc.)

  • Robust to outliers compared to squared losses (linear penalty in the tails)

  • For linear models, the objective is convex (global optimum)

Cons#

  • Non-smooth at \(y=\hat{y}\) → optimization uses subgradients / linear programming and may be slower

  • Requires choosing \(\alpha\) (often multiple values)

  • Not a single-number summary of full predictive uncertainty (you need multiple quantiles)

  • Fitting multiple quantiles independently can lead to quantile crossing (e.g. \(\hat{q}_{0.9}(x) < \hat{q}_{0.5}(x)\))

Good use cases#

  • Forecasting with uncertainty bands (delivery times, demand, energy load)

  • Finance / risk (Value-at-Risk style quantiles)

  • Any regression where over- vs under-prediction costs differ

8) Pitfalls & diagnostics#

  • Scale dependence: like MAE, mean pinball loss is in target units; compare across datasets only after scaling/normalizing.

  • Choose \(\alpha\) intentionally: align it with decisions (e.g. \(\alpha=0.9\) for “plan conservatively”).

  • Quantile crossing: when fitting multiple \(\alpha\) values, check ordering; consider joint methods that enforce monotonicity.

  • Coverage check: if you fit lower/upper quantiles, verify empirical coverage (e.g. fraction of points between \(\alpha=0.05\) and \(0.95\) predictions).

9) Exercises#

  1. Implement mean_pinball_loss_np without np.maximum (use only np.where).

  2. For a fixed dataset, plot the constant-prediction loss curves for \(\alpha \in \{0.1, 0.5, 0.9\}\) on the same chart.

  3. Fit two quantile regressors (0.1 and 0.9) and compute the empirical coverage of the resulting interval.

  4. Add L2 regularization to the subgradient descent objective and observe the effect on the fit.

References#

  • scikit-learn: sklearn.metrics.mean_pinball_loss

  • Koenker, R. (2005). Quantile Regression. Cambridge University Press.